Back

NAR Genomics and Bioinformatics

Oxford University Press (OUP)

All preprints, ranked by how well they match NAR Genomics and Bioinformatics's content profile, based on 214 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Single-cell identity definition using random forests and recursive feature elimination (scRFE)

Park, M.; Vorperian, S.; Wang, S.; Pisco, A. O.

2020-08-04 bioinformatics 10.1101/2020.08.03.233650 medRxiv
Top 0.1%
27.1%
Show abstract

Single-cell RNA sequencing (scRNA-seq) enables the detailed examination of a cells underlying regulatory networks and the molecular factors contributing to its identity. We developed scRFE with the goal of generating interpretable gene lists that can accurately distinguish observations (single-cells) by their features (genes) given a metadata category of the dataset. scRFE is an algorithm that combines the classical random forest classifier with recursive feature elimination and cross validation to find the features necessary and sufficient to classify cells in a single-cell RNA-seq dataset by ranking feature importance. It is implemented as a Python package compatible with Scanpy, enabling its seamless integration into any single-cell data analysis workflow that aims at identifying minimal transcriptional programs relevant to describing metadata features of the dataset. We applied scRFE to the Tabula Muris Senis and reproduced established aging patterns and transcription factor reprogramming protocols, highlighting the biological value of scRFEs learned features. Author summaryscRFE is a Python package that combines a random forest classifier with recursive feature elimination and cross validation to find the features necessary and sufficient to classify cells in a single-cell RNA-seq dataset by ranking feature importance. scRFE was designed to enable straightforward integration as part of any single-cell data analysis workflow that aims at identifying minimal transcriptional programs relevant to describing metadata features of the dataset.

2
A like-for-like comparison of lightweight-mapping pipelines for single-cell RNA-seq data pre-processing

Zakeri, M.; Srivastava, A.; Sarkar, H.; Patro, R.

2021-02-11 bioinformatics 10.1101/2021.02.10.430656 medRxiv
Top 0.1%
26.5%
Show abstract

Recently, Booeshaghi and Pachter (1) published a benchmark comparing the kallisto-bustools pipeline (2) for single-cell data pre-processing to the alevin-fry pipeline (3). Their benchmarking adopted drastically dissimilar configurations for these two tools, and overlooked the time- and space-frugal configurations of alevin-fry previously benchmarked by Sarkar et al. (3). In this manuscript, we provide a small set of modifications to the benchmarking scripts of Booeshaghi and Pachter that are necessary to perform a like-for-like comparison between kallisto-bustools and alevin-fry. We also address some misuses of the alevin-fry commands and include important data on the exact reference transcriptomes used for processing1. Using the same benchmarking scripts of Booeshaghi and Pachter (1), we demonstrate that, when configured to match the computational com-plexity of kallisto-bustools as closely as possible, alevin-fry processes data faster (~2.08 times as fast on average) and uses less peak memory (~ 0.34 times as much on average) compared to kallisto-bustools, while producing results that are similar when assessed in the manner done by Booeshaghi and Pachter (1). This is a notable inversion of the performance characteristics presented in the previous benchmark.

3
Information-theory-based benchmarking and feature selection algorithm improve cell type annotation and reproducibility of single cell RNA-seq data analysis pipelines

Ren, Z.; Gerlach, M.; Shi, H.; Misharin, A. V.; Budinger, G. S.; Amaral, L. A. N.

2020-11-04 bioinformatics 10.1101/2020.11.02.365510 medRxiv
Top 0.1%
26.2%
Show abstract

Single cell RNA sequencing (scRNA-seq) data are now routinely generated in experimental practice because of their promise to enable the quantitative study of biological processes at the single cell level. However, cell type and cell state annotations remain an important computational challenge in analyzing scRNA-seq data. Here, we report on the development of a benchmark dataset where reference annotations are generated independently from transcriptomic measurements. We used this benchmark to systematically investigate the impact on labelling accuracy of different approaches to feature selection, of different clustering algorithms, and of different sets of parameter values. We show that an approach grounded on information theory can provide a general, reliable, and accurate process for discarding uninformative features and to optimize cluster resolution in single cell RNA-seq data analysis.

4
g.nome, A Transparent Bioinformatics Pipeline that Enables Differential Expression and Alternative Splicing Analysis by Non-Computational Biologists

Corey, D. R.; Bryl, R.; Kang, X.; Wang, T. R.; Kunitomi, m.; Fuhrman, K.; Johnson, K.; Kearns, N.

2025-05-10 bioinformatics 10.1101/2025.05.09.652286 medRxiv
Top 0.1%
25.9%
Show abstract

Reproducibility and accessibility are cardinal principles in the rapidly evolving field of bioinformatics. As the collection of biological data grows, proper use of pipelines to analyze datasets can become a bottleneck restricting efficient analysis. Biologists who collect data and test hypotheses may not have strong computational backgrounds and may not be able to fully understand the underlying strengths and weaknesses of computational approaches or fully exploit their data. Some data may be misunderstood and, perhaps more importantly, critical findings may remain unobserved. High throughput RNA sequencing (RNAseq) has advanced our understanding of transcriptomics across diverse applications. Here we introduce g.nome, a bioinformatics platform that integrates contemporary tools necessary for independent analysis. A user-friendly graphical interface simplifies running jobs and allows simplified analysis of different datasets by non-bioinformaticians. g.nome was used to analyze the consequences of localizing the critical RNAi factor argonaute (AGO) to nuclei of colorectal cancer cell line HCT116. Analysis using the pipeline facilitated the straightforward identification of splicing changes and the prioritization of these splicing changes for validation and further experimental analysis. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=114 SRC="FIGDIR/small/652286v1_ufig1.gif" ALT="Figure 1"> View larger version (33K): org.highwire.dtl.DTLVardef@f0fa7org.highwire.dtl.DTLVardef@ccac5borg.highwire.dtl.DTLVardef@147c331org.highwire.dtl.DTLVardef@5fe66c_HPS_FORMAT_FIGEXP M_FIG C_FIG

5
A novel phylogenetic analysis combined with a machine learning approach predicts human mitochondrial variant pathogenicity

Akpinar, B. A.; Carlson, P. O.; Dunn, C. D.

2020-01-11 evolutionary biology 10.1101/2020.01.10.902239 medRxiv
Top 0.1%
25.8%
Show abstract

Linking mitochondrial DNA (mtDNA) variation to clinical outcomes remains a formidable challenge. Diagnosis of mitochondrial disease is hampered by the multicopy nature and potential heteroplasmy of the mitochondrial genome, differential distribution of mutant mtDNAs among various tissues, genetic interactions among alleles, and environmental effects. Here, we describe a new approach to the assessment of which mtDNA variants may be pathogenic. Our method takes advantage of site-specific conservation and variant acceptability metrics that minimize previous classification limitations. Using our novel features, we deploy machine learning to predict the pathogenicity of thousands of human mtDNA variants. Our work demonstrates that a substantial fraction of mtDNA changes not yet characterized as harmful are, in fact, likely to be deleterious. Our findings will be of direct relevance to those at risk of mitochondria-associated metabolic disease.

6
Bio informatics: Integrate negative controls to get the good data

van Nues, R. W.

2024-10-09 bioinformatics 10.1101/2024.10.08.617225 medRxiv
Top 0.1%
23.0%
Show abstract

High-throughput datasets, like any experimental output, can be full of noise. Negative controls, i.e. mock experiments not providing information concerning the biological system under study, visualise background. Overlooking this training set of wrong examples in publicly available datasets can seriously undermine validity of bioinformatics analyses. We present a program, COALISPR, for explicit and transparent application of negative control data in the comparison of high-throughput sequencing results. This yields mapping coordinates that guide fast counting of reads, bypassing the need for a reference file, and is especially relevant when small RNA sequencing libraries contaminated with breakdown products are analysed for poorly annotated organisms. We have re-analysed small RNA datasets for mouse and fungus Cryptococcus neoformans, leading to consistent identification of miRNAs and of fungal transcripts targeted by siRNAs. Cryptococcal Argonautes are directed to spliced transcripts indicating that RNAi must be triggered by events downstream of intron removal. Negative control datasets contain large amounts of ribosomal RNA (rRNA) fragments (rRFs). These differ from small RNAs associated with RNAi, making a biological role for rRFs in association with Argonautes unlikely. Background signals enabled identification of cryptococcal genes for RNase P, U1 snRNA, 37 H/ACA and 63 Box C/D snoRNAs, including U3 and U14 essential for pre-rRNA processing. To gain meaning, high-throughput RNA-Seq analyses need to incorporate negative data. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=45 SRC="FIGDIR/small/617225v4_ufig1.gif" ALT="Figure 1"> View larger version (15K): org.highwire.dtl.DTLVardef@c44bdcorg.highwire.dtl.DTLVardef@1509468org.highwire.dtl.DTLVardef@13f398borg.highwire.dtl.DTLVardef@1dae4b3_HPS_FORMAT_FIGEXP M_FIG C_FIG

7
What is a differentially expressed gene?

Hoerbst, F.; Sidhu, G. S.; Tomkins, M.; Morris, R. J.

2025-02-01 bioinformatics 10.1101/2025.01.31.635902 medRxiv
Top 0.1%
22.4%
Show abstract

The concept of Differentially Expressed Genes (DEGs) is central to RNA-Seq studies, yet their identification suffers from reproducibility issues. This is largely a consequence of the inherent biological and technical variation that cannot be captured with small numbers of replicates. When thresholds for p-values and log2 fold changes are introduced, this variability can propagate an incomplete description of the data, leading to differing interpretations. Here, we compare traditional binary DEG classification with a rank-based method, grounded in Bayesian statistics, using a published yeast dataset comprising over 40 replicates. This analysis reveals how the choice of thresholds and number of replicates results in discrepancies between studies and potentially interesting genes being overlooked. Furthermore, by comparing wild-type with wild-type samples, we show how variability in gene expression can be mistaken for differential expression. Evaluating current practices for navigating the accuracy-error trade-off in the search for differentially expressed genes leads us to advocate rank-based methods and Bayesian statistics to mitigate the limitations of binary classifications and communicate uncertainty.

8
A Bayesian inference tool for identifying artifactual calls from differential transcript abundance analyses

Mangiola, S.; Modrak, M.; Thomas, E.; Papenfuss, A. T.

2020-02-28 bioinformatics 10.1101/2020.02.27.967240 medRxiv
Top 0.1%
19.0%
Show abstract

Relative transcript abundance has proven to be a valuable tool for inferring the phenotype of biological systems from genetic material. Several methods for the analysis of differential transcript abundance have been developed, and some of the most popular are based on negative binomial models. Although most genes are fitted reasonably well by the negative binomial distribution, the presence of outlier observations that do not fit such models can lead to artifactual identification of significant changes in transcription. Identifying those transcripts for the correct interpretation of results is extremely important. A robust and automated tool for detecting sample/transcript pairs that do not fit a negative binomial regression model is currently lacking. Here we propose ppcseq, a robust statistical framework that models hierarchically sample- and gene-wise features such as sequencing depth bias, the association between mean transcript abundance and its over-dispersion, and provides a theoretical transcript abundance distribution, on which the observed transcript abundance can be tested for outliers. We show using a publicly available data set where nearly 10% of differentially abundant transcripts had fold change inflated by the presence of outliers. This method has broad utility in filtering artifactual results of differential transcript abundance analyses based on a negative binomial framework.

9
Manipulating base quality scores enables variant calling from bisulfite sequencing alignments using conventional Bayesian approaches

Nunn, A.; Otto, C.; Stadler, P. F.; Langenberger, D.

2021-01-11 bioinformatics 10.1101/2021.01.11.425926 medRxiv
Top 0.1%
18.9%
Show abstract

Calling germline SNP variants from bisulfite-converted sequencing data poses a challenge for conventional software, which have no inherent capability to dissociate true polymorphisms from artificial mutations induced by the chemical treatment. Nevertheless, SNP data is desirable both for genotyping and to understand the DNA methylome in the context of the genetic background. The confounding effect of bisulfite conversion can be resolved by observing differences in allele counts on a per-strand basis, whereby artificial mutations are reflected by non-complementary base pairs. Herein, we present a computational pre-processing approach for adapting sequence alignment data, thus indirectly enabling downstream analysis in this manner using conventional variant calling software such as GATK or Freebayes. In comparison to specialised tools, the method represents a marked improvement in precision-sensitivity based on high-quality, published benchmark datasets for both human and model plant variants.

10
Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

Dong, X.; Du, M. R. M.; Gouil, Q.; Tian, L.; Jabbari, J. S.; Bowden, R.; Baldoni, P. L.; Chen, Y.; Smyth, G. K.; Amarasinghe, S. L.; Law, C. W.; Ritchie, M. E.

2023-05-18 bioinformatics 10.1101/2022.07.22.501076 medRxiv
Top 0.1%
18.7%
Show abstract

The current lack of benchmark datasets with inbuilt ground-truth makes it challenging to compare the performance of existing long-read isoform detection and differential expression analysis workflows. Here, we present a benchmark experiment using two human lung adenocarcinoma cell lines that were each profiled in triplicate together with synthetic, spliced, spike-in RNAs ("sequins"). Samples were deeply sequenced on both Illumina short-read and Oxford Nanopore Technologies long-read platforms. Alongside the ground-truth available via the sequins, we created in silico mixture samples to allow performance assessment in the absence of true positives or true negatives. Our results show that, StringTie2 and bambu outperformed other tools from the 6 isoform detection tools tested, DESeq2, edgeR and limma-voom were best amongst the 5 differential transcript expression tools tested and there was no clear front-runner for performing differential transcript usage analysis between the 5 tools compared, which suggests further methods development is needed for this application.

11
Machine learning differentiates between bulk and pseudo-bulk RNA-seq datasets

Low, B. H.; Rashid, M. M.; Selvarajoo, K.

2025-07-01 bioinformatics 10.1101/2025.06.27.661895 medRxiv
Top 0.1%
18.6%
Show abstract

Modern synthetic data generators and deconvolution methods rely heavily on single-cell (sc) RNA- seq data. Aggregated scRNA-seq (pseudo-bulk) is commonly assumed to closely match true bulk RNA-seq, making it a dependable benchmark for developing and evaluating new bioinformatics methods. Here, we investigated paired bulk and scRNA-seq datasets using machine learning techniques to assess the fidelity of pseudo-bulk profiles. Our results demonstrate that pseudo-bulks differ substantially from bulk RNA-seq in both analytic metrics and biological processes.

12
Benchmarking tRNA-Seq quantification approaches by realistic tRNA-Seq data simulation identifies two novel approaches with higher accuracy

Smith, T. S.; Monti, M.; Willis, A. E.; Kalmar, L.

2023-12-14 bioinformatics 10.1101/2023.12.13.571582 medRxiv
Top 0.1%
18.4%
Show abstract

Quantification of transfer RNA (tRNA) using illumina sequencing based tRNA-Seq is complicated due to their degree of redundancy and extensive modifications. As such, no tRNA-Seq method has become well established, while various approaches have been proposed to quantify tRNAs from sequencing reads. Here, we use realistic tRNA-Seq simulations to benchmark tRNA-Seq quantification approaches, including two novel approaches. We demonstrate that these novel approaches are consistently the most accurate, using data simulated to mimic five different tRNA-Seq methods. This simulation-based benchmarking also identifies specific shortfalls for each quantification approach and suggests that up to 13% of the variance observed between cell lines in real tRNA-Seq data could be due to systematic differences in quantification accuracy.

13
Accounting for fragments of unexpectedorigin improves transcript quantification inRNA-seq simulations focused on increased realism

Srivastava, A.; Zakeri, M.; Sarkar, H.; Soneson, C.; Kingsford, C.; Patro, R.

2021-01-19 bioinformatics 10.1101/2021.01.17.426996 medRxiv
Top 0.1%
18.4%
Show abstract

Transcript and gene quantification is the first step in many RNA-seq analyses. While many factors and properties of experimental RNA-seq data likely contribute to differences in accuracy between various approaches to quantification, it has been demonstrated (1) that quantification accuracy generally benefits from considering, during alignment, potential genomic origins for sequenced fragments that reside outside of the annotated transcriptome. Recently, Varabyou et al. (2) demonstrated that the presence of transcriptional noise leads to systematic errors in the ability of tools -- particularly annotation-based ones -- to accurately estimate transcript expression. Here, we confirm the findings of Varabyou et al. (2) using the simulation framework they have provided. Using the same data, we also examine the methodology of Srivastava et al.(1) as implemented in recent versions of salmon (3), and show that it substantially enhances the accuracy of annotation-based transcript quantification in these data.

14
Analysis of Transcriptograms in Epithelial-Mesenchymal Transition (EMT)

Santos, O. J.; Dalmolin, R. J.; de Almeida, R. M. C.

2026-02-18 bioinformatics 10.64898/2026.02.16.706231 medRxiv
Top 0.1%
18.1%
Show abstract

Single-cell RNA sequencing (single-cell RNA-seq) has represented a revolution in gene expression analysis. However, high dropout rates and stochastic noise often reduce the amount of information captured in these experiments. The epithelial-mesenchymal transition (EMT), which is fundamental to tumor progression and organismal development, is particularly difficult to fully characterize due to the existence of intermediate states. In this work, we demonstrate that projecting transcriptomic data onto gene lists ordered using protein-protein interaction (PPI) information acts as a "biological low-pass filter", attenuating technical noise and increasing the statistical power of the analyses. We propose and validate an innovative pipeline that integrates the Transcriptogram method with Principal Component Analysis (PCA). By applying a moving average over functionally ordered genes, we drastically increase the signal-to-noise ratio, enabling the inference of cellular trajectories. The method was applied to a public dataset of TGF-{beta}1-induced MCF10A cells, with rigorous batch-effect correction based on biological controls. The results reveal that EMT is not merely a morphological change, but a coordinated, systemic reprogramming. This approach enabled the identification of critical modules that would remain hidden in conventional analyses: (i) a massive "Metabolic Switch" (Cluster 2), indicating a transition toward oxidative phosphorylation to sustain invasion; (ii) a strategic blockade of the cell cycle (Cluster 4); and (iii) a "Detoxification Shield" and chemoresistance program (Cluster 5), characterized by endogenous activation of metallothioneins. We conclude that the combination of PPI network topology and dimensionality reduction offers superior resolution for dissecting cellular plasticity. The method not only validates classical markers, but also reveals the hidden functional architecture of the transition, showing that EMT is not a single, uniform process, but rather one in which cells can follow distinct trajectories, halting at different stages of differentiation.

15
Differential quantification of alternative splicing events on spliced pangenome graphs

Ciccolella, S.; Cozzi, D.; Della Vedova, G.; Kuria, S.; Bonizzoni, P.; Denti, L.

2023-11-07 bioinformatics 10.1101/2023.11.06.565751 medRxiv
Top 0.1%
18.0%
Show abstract

Pangenomes are becoming a powerful framework to perform many bioinformatics analyses taking into account the genetic variability of a population, thus reducing the bias introduced by a single reference genome. With the wider diffusion of pangenomes, integrating genetic variability with transcriptome diversity is becoming a natural extension that demands specific methods for its exploration. In this work, we extend the notion of spliced pangenomes to that of annotated spliced pangenomes; this allows us to introduce a formal definition of Alternative Splicing (AS) events on a graph structure. To investigate the usage of graph pangenomes for the quantification of AS events across conditions, we developed pantas, the first pangenomic method for the detection and differential analysis of AS events from short RNA-Seq reads. A comparison with state-of-the-art linear reference-based approaches proves that pantas achieves competitive accuracy, making spliced pangenomes effective for conducting AS events quantification and opening future directions for the analysis of population-based transcriptomes. pantas is open-source and freely available at github.com/algolab/pantas. Author summaryThe ever increasing availability of complete genomes is advancing our comprehension of many biological mechanisms and is enhancing the knowledge we can extract from sequencing data. Pange3PM ESTnome graphs are a convenient way to represent multiple genomes and the genetic variability within a population. Integrating genetic variability with transcriptome diversity can improve our understanding of alternative splicing, a regulation mechanism which allows a single gene to code for multiple proteins. However, many unanswered questions are limiting our comprehension of the relationship between genetic and trancriptomic variations. With this work, we start to fill this gap by introducing pantas, the first approach based on pangenome graphs for the detection and differential quantification of alternative splicing events. A comparison with state-of-the-art approaches based on linear genome prove that pangenome graphs can be effectively used to perform such an analysis. By integrating genetic and transcriptome variability in a single structure, pantas can pave the way to next generation bioinformatic approaches for the accurate analysis of the relations between genetic variations and alternative splicing aberrations.

16
BRACE: A novel Bayesian-based imputation approach for dimension reduction analysis of alternative splicing at single-cell resolution

Wen, S.

2024-08-05 bioinformatics 10.1101/2024.08.01.606201 medRxiv
Top 0.1%
18.0%
Show abstract

Bayesian approach is a powerful tool to solve challenging questions in life sciences. One such area of life sciences in which Bayesian approach has seen an increased utility in the recent years is single-cell biology. Alternative splicing represents an additional layer of complexity underlying gene expression profile that has the potential to reveal insights into the biological mechanisms underpinning heath and disease states. Dimension reduction analysis is the cornerstone of RNA-sequencing analysis and has the ability to guide selection of candidate biomarkers based on segregation of sample groups. Nevertheless, dimension reduction analysis at single- cell resolution remains a significant challenge for alternative splicing datasets, and therefore hitherto preclude the assessment of candidate isoforms. Here, we introduce BRACE (a Bayesian-based imputation approach for dimension Reduction Analysis of alternative splicing at single-CEll resolution). We demonstrated our Bayesian approach represents an improvement over existing methods for imputing missing percent spliced-in values, and subsequently applied our approach for the dimension reduction analysis of alternative splicing events at single-cell resolution. We further demonstrated the application of our Bayesian approach over a range of single-cell datasets with increasing complexity, namely cell populations that are transcriptomically distinct, similar, and heterogenous. We anticipate our approach to enable assessment and selection of cell state- or disease-specific biomarkers for downstream experimental validation.

17
ECSFinder: Optimized prediction of evolutionarily conserved RNA secondary structures from genome sequences

Gaonac'h-Lovejoy, V. A.; Sauvageau, M.; Mattick, J. S.; Smith, M. A.

2024-09-19 bioinformatics 10.1101/2024.09.14.612549 medRxiv
Top 0.1%
17.9%
Show abstract

Accurate prediction of RNA secondary structures is essential for understanding the evolutionary conservation and functional roles of long noncoding RNAs (lncRNAs) across diverse species. In this study, we benchmarked two leading tools for predicting evolutionarily conserved RNA secondary structures (ECSs)--SISSIz and R-scape-- using two distinct experimental frameworks: one focusing on well-characterized mitochondrial RNA structures and the other on experimentally validated Rfam structures embedded within simulated genome alignments. While both tools performed comparably overall, each displayed subtle preferences in detecting ECSs. To address these limitations, we evaluated two interpretable machine learning approaches that integrate the strengths of both methods. By balancing thermodynamic stability features from RNALalifold and SISSIz with robust covariation metrics from R-scape, a random forest classifier significantly outperformed both conventional tools. This classifier was implemented in ECSfinder, a new tool that provides a robust, interpretable solution for genome-wide identification of conserved RNA structures, offering valuable insights into lncRNA function and evolutionary conservation. ECSfinder is designed for large-scale comparative genomics applications and promises to facilitate the discovery of novel functional RNA elements.

18
GRaNIE and GRaNPA: Inference and evaluation of enhancer-mediated gene regulatory networks applied to study macrophages

Kamal, A.; Arnold, C.; Claringbould, A.; Moussa, R.; Daga, N.; Nogina, D.; Kholmatov, M.; Servaas, N.; Mueller-Dott, S.; Reyes-Palomares, A.; Palla, G.; Sigalova, O.; Bunina, D.; Pabst, C.; Zaugg, J. B.

2022-02-07 genomics 10.1101/2021.12.18.473290 medRxiv
Top 0.1%
17.4%
Show abstract

Among the biggest challenges in the post-GWAS (genome-wide association studies) era is the interpretation of disease-associated genetic variants in non-coding genomic regions. Enhancers have emerged as key players in mediating the effect of genetic variants on complex traits and diseases. Their activity is regulated by a combination of transcription factors (TFs), epigenetic changes and genetic variants. Several approaches exist to link enhancers to their target genes, and others that infer TF-gene connections. However, we currently lack a framework that systematically integrates enhancers into TF-gene regulatory networks. Furthermore, we lack an unbiased way of assessing whether inferred regulatory interactions are biologically meaningful. Here we present two methods, implemented as user-friendly R packages: GRaNIE (Gene Regulatory Network Inference including Enhancers) for building enhancer-based gene regulatory networks (eGRNs) and GRaNPA (Gene Regulatory Network Performance Analysis) for evaluating GRNs. GRaNIE jointly infers TF-enhancer, enhancer-gene and TF-gene interactions by integrating open chromatin data such as ATAC-Seq or H3K27ac with RNA-seq across a set of samples (e.g. individuals), and optionally also Hi-C data. GRaNPA is a general framework for evaluating the biological relevance of TF-gene GRNs by assessing their performance for predicting cell-type specific differential expression. We demonstrate the power of our tool-suite by investigating gene regulatory mechanisms in macrophages that underlie their response to infection and cancer, their involvement in common genetic diseases including autoimmune diseases, and identify the TF PURA as putative regulator of pro-inflammatory macrophage polarisation. Availability- GRaNIE: https://bioconductor.org/packages/release/bioc/html/GRaNIE.html - GRaNPA: https://git.embl.de/grp-zaugg/GRaNPA Graphical abstract O_FIG_DISPLAY_L [Figure 1] M_FIG_DISPLAY C_FIG_DISPLAY

19
Using a Whole Genome Co-expression Network to Inform the Functional Characterisation of Predicted Genomic Elements from Mycobacterium tuberculosis Transcriptomic Data

Stiens, J.; Tan, Y. Y.; Joyce, R.; Arnvig, K. B.; Kendall, S. L.; Nobeli, I.

2022-06-23 bioinformatics 10.1101/2022.06.22.497203 medRxiv
Top 0.1%
17.3%
Show abstract

A whole genome co-expression network was created using Mycobacterium tuberculosis transcriptomic data from publicly available RNA-sequencing experiments covering a wide variety of experimental conditions. The network includes expressed regions with no formal annotation, including putative short RNAs and untranslated regions of expressed transcripts, along with the protein-coding genes. These unannotated expressed transcripts were among the best-connected members of the module sub-networks, making up more than half of the hub elements in modules that include protein-coding genes known to be part of regulatory systems involved in stress response and host adaptation. This dataset provides a valuable resource for investigating the role of non-coding RNA, and conserved hypothetical proteins, in transcriptomic remodelling. Based on their connections to genes with known functional groupings and correlations with replicated host conditions, predicted expressed transcripts can be screened as suitable candidates for further experimental validation.

20
A Bayesian framework for ranking genes based on their statistical evidence for differential expression

Hoerbst, F.; Sidhu, G. S.; Omori, T.; Tomkins, M.; Morris, R. J.

2025-01-22 bioinformatics 10.1101/2025.01.20.633909 medRxiv
Top 0.1%
17.0%
Show abstract

Advances in sequencing technologies have revolutionised our ability to capture the complete RNA profile in tissue samples, allowing for comparative analyses of RNA levels between developmental processes, environmental responses, or treatments. However, quantifying changes in gene expression remains challenging, given inherent biological variability and technological limitations. To address this, we introduce a Bayesian framework for differential gene expression (DGE) analysis. Our framework unifies and streamlines a complex analysis, typically involving parameter estimations and multiple statistical tests, into a concise mathematical equation. This allows statistical evidence for differential expression to be computed rapidly and transparently. We show how this approach can be used to evaluate variabilty of individual genes between replicates. A comparison of our framework with existing tools revealed differences that can be explained by commonly employed thresh-olds in other packages. This motivated us to explore ranking genes based on their statistical evidence as opposed to a binary classification as DEGs. Our analysis leads us to advocate the use of Bayes factors within a rank-based approach. This framework offers enhanced computational efficiency and delivers a transparent way to analyse, interpret and communicate DGE results.